Goto

Collaborating Authors

 differentiable meta-learning


Review for NeurIPS paper: Differentiable Meta-Learning of Bandit Policies

Neural Information Processing Systems

As in standard policy-gradient methods, it seems that two key parameters are the batch-size m and the horizon n. It would be good to provide some sensitivity analysis on these parameters to better assess how the approach scales to complex problems. In particular, what is the effect of the horizon on the gradient estimation? Does the variance blow up or is the baseline sufficient to keep it under control? In this sense, it might be good to have differentiable strategies that are provably efficient (e.g., with sub-linear regret) for a range of parameter values, so that whather value of \theta we encounter during its optimization will not performed poorly.


Review for NeurIPS paper: Differentiable Meta-Learning of Bandit Policies

Neural Information Processing Systems

The rebuttal helped clarify the questions raised in the review. The consensus reached in the discussion is that this is a borderline-plus paper. The reviewers appreciate the contribution's practicality, relevance and usefulness, and at the same time they do remain concerned about the narrow scope, and would rather have seen the policy-gradient method applied to parameterized policies for more complex learning problems. On the whole, this is a worthwhile addition to the program. The rebuttal did not answer one question successfully, namely regarding the setup in the experiments section, where the learning process operating at two-levels remained confusing.


Differentiable Meta-Learning of Bandit Policies

Neural Information Processing Systems

Exploration policies in Bayesian bandits maximize the average reward over problem instances drawn from some distribution P. In this work, we learn such policies for an unknown distribution P using samples from P. Our approach is a form of meta-learning and exploits properties of P without making strong assumptions about its form. To do this, we parameterize our policies in a differentiable way and optimize them by policy gradients, an approach that is pleasantly general and easy to implement. We derive effective gradient estimators and propose novel variance reduction techniques. We also analyze and experiment with various bandit policy classes, including neural networks and a novel softmax policy. The latter has regret guarantees and is a natural starting point for our optimization.